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Abstract 

A  branch  of  Complexity  Theory  called  Information-Based  Complexity  Theory 
(IBCT),  produces  an  abundance  of  impressive  results  about  the  quest  for  approximate  solu¬ 
tions  to  mathematical  problems.  Why  then  do  most  numerical  analysts  turn  a  cold  shoulder 
to  IBCT?  Close  analysis  of  two  papers  representative  of  IBCT’s  best  efforts  reveals  a  mix¬ 
ture  of  nice  new  observations,  misdirected  examples  and  misleading  theorems. 

Some  elements  in  the  framework  of  IBCT,  erected  to  support  a  rigorous  yet  flexible 
theory,  make  it  difficult  to  judge  whether  a  model  is  off-target  or  reasonably  realistic.  For 
instance,  a  sharp  distinction  is  made  between  information  and  algorithms  restricted  to  this 
information.  Yet  the  information  itself  usually  comes  from  an  algorithm  and  so  the  distinc¬ 
tion  clouds  the  issues  and  can  lead  to  true  but  misleading  inferences. 

Another  troublesome  aspect  of  IBCT  is  a  free  parameter  F,  the  class  of  admissible 
problem  instances,  whose  membership  fee  is  completely  ignored  in  ascertaining  the  cost  of 
solving  the  worst  case  in  F.  Sometimes  this  leads  to  unrealistic  models. 

We  conclude  that  one’s  satisfaction  with  each  result  of  IBCT  must  be  inversely  pro¬ 
portional  to  what  one  knows  about  the  problem.  The  surprising  results  known  to  us  pertain 
only  to  unnatural  simations  and  IBCT’s  genuinely  new  insights  might  serve  us  better  if 
expressed  in  the  conventional  mode  of  error  bounds  and  approximation  theory. 
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Abstract 

Some  critical  comments  on  information  based  complexity  theoiy-  (IBCT)  are  offered. 
They  may  help  to  explain  why  most  numerical  analysts  have  turned  a  cold  shoulder  to  this 
particular  brand  of  Complexity  Theory. 

The  output  of  IBCT  is  abundant,  impressive,  and  appears  to  address  the  same  sort  of 
problem  that  interests  numerical  analysts;  the  quest  for  approximate  rather  than  exact  solu¬ 
tions.  However  a  careful  examination  of  two  papers  reveals  a  different  state  of  affairs.  We 
find  a  mixmre  of  repackaged  error  bounds,  nice  new  observations,  misdirected  examples 
and  misleading  theorems. 

Our  conclusion  is  that,  in  these  cases,  the  framework  of  IBCT,  erected  to  permit  a 
rigorous  theoretical  development,  makes  it  difficult  to  tell  when  the  models  are  off  target 
and  when  they  are  reasonably  realistic.  The  less  one  knows  about  a  particular  problem  the 
easier  it  is  to  be  satisfied  with  the  IBCT  results.  It  seems  that  the  genuinely  new  insights 
may  be  expressed  better  in  the  conventional  mode  of  approximation  theory  and  error 
bounds  while  the  surprising  theorems  turn  out  to  apply  to  unnatural  situations. 

IBCT  is  flexible  and  it  embraces  distoned  models  as  easily  as  it  does  genuine  ones. 
There  are  several  sources  of  trouble  including 

(i)  a  misleading  distinction  between  information  and  algorithms  ,  and 

(ii)  a  free  parameter,  the  class  F ,  whose  membership  fee  is  completely  ignored. 
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1.  Introduction  and  Summary 

The  incentive  to  write  this  essay  came  from  discussions  held  during  the  workshop  at 
the  Mathematical  Sciences  Research  Institute  (Berkeley)  in  January,  1986  under  the  title 
‘Problems  relating  computer  science  to  numerical  analysis’. 

In  1980  J.  Traub  and  H.  Wozniakowski  published  a  monograph  entitled  ‘A  General 
Theory  of  Optimal  Algorithms’,  which  initiated  a  new  line  of  research.  The  subject  was 
initially  called  analytic  complexity  theory  but  is  now  referred  to  as  information-based  com¬ 
plexity  theory  (IBCT  hereafter).  TTie  August  1987  issue  of  the  Bulletin  of  the  A.M.S.  con¬ 
tains  a  summary  of  recent  results,  [Pac&Wo,1987]  so  acceptance  of  this  branch  of  com¬ 
plexity  theory  has  been  swift. 

One  purpose  of  the  general  theory  is  to  provide  an  attractive  setting  for  the  systematic 
study  of  the  sort  of  problems  that  have  engaged  numerical  analysts  for  decades.  One 
among  the  programs  of  IBCT  is  to  determine  the  minimal  cost  of  computing  an  approxi¬ 
mate  solution  (of  given  accuracy)  over  all  algorithms  that  one  could  use  which  restrict 
themselves  to  certain  limited  information  about  the  data.  It  is  also  of  interest  to  discover 
any  algorithms  that  achieve  this  minimal  cost  or,  at  least,  come  close  to  it.  Pride  of  place 
in  IBCT  is  given  to  the  information  to  which  the  algorithms  are  limited.  By  its  choice  of 
problems  IBCT  is  (potentially)  a  branch  of  complexity  theory  that  is  highly  relevant  to 
numerical  analysts.  Whenever  the  minimal  costs  can  be  estimated  they  provide  a  yardstick 
against  which  to  measure  actual  algorithms.  It  seems  to  be  an  attractive  program. 

IBCT  is  careful  to  distinguish  itself  from  the  older  specialty  now  called  arithmetic 
complexity  which  is  concerned  with  discrete  problems  such  as  the  minimal  number  of  basic 
arithmetic  operations  required  to  form  the  product  of  two  nxn  matrices.  See,  for  example, 
[Ra,1972],  [Str,1969],  [Sch&Str,1971]  and  [Wi,1970].  Arithmetic  complexity  is  a  deep  and 
important  field.  Another  way  to  model  some  aspects  of  scientific  computing  was  intro¬ 
duced  in  1988  by  L.  Blum,  M.  Shub  and  S.  Smale,  [Bl,Sh&Sm,1988];  Algebraic  Complex¬ 
ity  allows  computation  over  rings,  in  particular  the  real  number  field. 

The  purpose  of  this  paper  is  to  sound  a  warning  about  IBCT,  not  these  other  branches 
of  complexity.  The  next  few  paragraphs  point  out  why  our  task  is  less  than  straightforw'ard 
and  why  our  observations  have  more  than  local  interest. 

First,  unlike  some  other  disciplines,  mathematics  lacks  a  tradition  of  public  criticism 
and  there  is  a  chance  that  our  criticisms  will  be  construed  (wrongly)  as  an  attack  on  the 
technical  ability  of  the  founders  of  IBCT.  Second,  any  objections  must  focus  on  specific 
work.  Here  we  discuss  two  related  papers  which  we  believe  to  be  typical:  [Tr&Wo,84]  and 
[Ku,86].  They  are  examined  in  detail  in  Sections  4  and  5.  IBCT  covers  .r.any  areas 
besides  the  matrix  computations  considered  here,  but  we  concentrate  on  vnat  we  know 
best.  Third,  our  task  is  complicated  by  the  fact  that  both  papers  contain  observations  that 
are  new  and  of  independent  technical  interest.  Both  are  written  in  a  professional  manner 
and  the  analysis  is  not  weak.  Our  claim  is  that  the  results,  most  of  them,  are  seriously 
misleading. 

Since  the  arguments  in  the  papers  are  impeccable  the  flaw  must  be  in  the  framework. 
Yet  the  definitions  are  laid  out  plainly  for  all  to  see  and  seem  to  be  appropriate—  a  puz¬ 
zling  situation. 

One  source  of  difficulty  is  the  redefinition  of  common  terms  such  as  ‘eigenvalue  prob¬ 
lem’  (see  Sect.  2(E))  or  ‘worst  case’  (see  Sect.  2(C))  or  ‘information’  (see  Sect.  4.4)  or 
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‘algorithm’  (sec  Sect.  2(A)  &  3.2  &  4.4  &  5.3).  These  slight  twists  to  conventional  mean¬ 
ings  are  subtle  enough  to  escape  notice  but  entail  significant  consequences.  The  results 
mislead  because  they  are  remembered  and  discussed  in  terms  of  ordinary  word  usage.  Most 
readers  will  not  even  be  aware  of  the  shifts  in  meaning,  some  of  which  are  due  to  the 
tempting  but  artificial  distinction  between  information  and  algorithm. 

Another  feature  of  IBCT  that  can  sometimes  rob  results  of  their  relevance  is  the  pres¬ 
ence  of  a  free  parameter:  the  class  F  from  which  worst  cases  are  to  be  drawn.  The  cost  of 
testing  membership  in  F  is  ignored  by  IBCT  and  so  the  model  loses  validity  whenever  this 
cost  is  not  negligible. 

Some  of  our  criticisms  require  little  knowledge  of  the  subject  matter.  These  are 
presented  in  the  next  section.  After  that  we  get  down  to  details,  provide  some  background 
material  and  then  examine  each  paper  in  turn.  In  our  summaries  we  try  to  be  fair  but  we 
encourage  the  interested  reader  to  compare  our  effort  with  the  original  work. 

A  handful  of  reservations  about  IBCT  have  appeared  in  print.  The  reviewer  of  the  ori¬ 
ginal  1980  General  Theory  of  Optimal  Algorithms  in  the  SIAM  Review  saw  no  blemish  in 
the  models  generated  by  IBCT,  [Pac,1986].  In  a  review  of  the  second  monograph  M. 
Shub  [Shu,  1987]  gives  a  couple  of  instances  of  unnatural  measures  of  cost.  In  [Sm,1985] 
S.  Smale  makes  a  penetrating  observation  that  the  sharp  distinction  between  information 
and  algorithm  may  be  too  rigid  to  reflect  the  nature  of  the  approximation  problems  that 
IBCT  investigates.  In  fact  this  sharp  distinction  is  a  cause  of  subtle  changes  in  the  use  of 
the  word  algorithm  mentioned  in  an  earlier  paragraph.  See  Section  3.2  for  details.  In 
[Ba,1987]  1.  Babushka  calls  on  researchers  in  IBCT  to  make  their  models  more  realistic. 
We  concur  but  note  that  there  are  at  least  two  ways  in  which  a  model  may  fail  to  be  realis¬ 
tic.  On  one  hand  all  that  may  be  needed  is  a  relaxation  of  some  assumptions;  on  the  other 
hand  the  model  may  be  so  flexible  as  to  embrace  pointless  investigations  as  readily  as  per¬ 
tinent  ones. 

We  make  no  complaint  that  IBCT  ignores  the  roundoff  error  that  afflicts  implementa¬ 
tion  on  digital  computers.  First  of  all  a  good  understanding  of  computation  in  exact  arith¬ 
metic  is  a  prerequisite  for  tackling  practical  issues.  Secondly  we  must  acknowledge  that  a 
large  part  of  theoretical  numerical  analysis  confines  itself  to  the  comforts  of  exact  arith¬ 
metic. 

IBCT  has  already  produced  a  large  body  of  results,  some  of  them  surprising  and  con¬ 
sequently  of  potential  interest.  Yet  each  surprising  result  known  to  us,  in  worst-case 
analysis,  holds  only  within  a  model  sufficiently  unnatural  as  to  forfeit  attention  from 
numerical  analysts.  This  is  a  pity  because  IBCT  certainly  permits  realistic  models  and  there 
is  plenty  to  do;  the  investigation  of  average  case  complexity  of  approximately  solved  prob¬ 
lems  is  in  its  infancy.  It  would  take  only  a  few  illuminating  results  concerning  some  rea¬ 
sonable  models  to  restore  faith  in  the  program  launched  by  the  General  Theory  of  Optimal 
Algorithms.  Even  then  we  doubt  that  all  the  notation  (see  Section  4.2)  is  really  necessary. 
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2.  High  Level  Criticisms 

A)  This  is  not  Complexity  Theory 

Numerical  Analysis  and  Complexity  Theory  are  palpably  different  subjects.  Complex¬ 
ity  Theory  (CT  hereafter)  seeks  to  determine  the  intrinsic  difficulty  of  certain  tasks, 
whereas  much  of  theoretical  Numerical  Analysis  (NA  hereafter)  has  been  concerned  with 
studying  classes  of  algorithms,  analyzing  convergence  and  stability,  developing  error 
bounds  (either  a  priori  or  a  posteriori),  and  detecting  either  optimal  or  desirable  members 
of  a  class  according  to  various  criteria.  Clearly  CT  has  more  ambitious  goals  than  does 
NA. 

One  major  theme  of  IBCT  is  to  find  the  minimal  cost  of  achieving  a  certain  level  of 
approximation  for  the  hardest  case  in  a  given  problem  class  F  using  any  algorithm  that 
confines  itself  to  certain  partial  information  about  the  case.  One  of  the  papers  we  examine 
is  concerned  with  the  matrix  eigenvalue  problem  and  the  other  with  the  solution  of  large 
systems  of  linear  equations.  See  [Ku,1986]  and  lTr&Wo,1984].  Now  we  can  formulate 
our  first  complaint,  one  that  applies  to  the  results  in  both  papers. 

The  theorems  say  nothing  about  the  intrinsic  cost  of  computing  an  approximate 
solution  to  either  of  the  problems  mentioned  above  because  the  specified  infor¬ 
mation  is  not  naturally  associated  with  the  task  but  is  acquired  when  a  certain 
class  of  numerical  methods  is  employed. 

The  class  is  sometimes  called  the  Krylov  subspace  methods;  one  is  not  given  a 
matrix  A  explicitly,  but  instead  a  few  vectors  b,  Ab,  A^b,  A^b,  ...,  A^b,  and  one  wishes 
to  approximate  the  solution  of  a  system  of  linear  equations  Ax=^  or  some  specified  eigen- 
pair  of  A.  More  details  are  given  in  Section  3.2  and  3.3.  So  the  invitation  to  minimize 
cost  over  all  algorithms  subject  to  the  given  information  turns  out,  in  these  cases,  to 
amount  to  the  quest  for  the  best  that  can  be  achieved  at  each  step  of  a  Krylov  subspace 
method.  This  is  exactly  the  sort  of  work  that  numerical  analysts  do. 

We  do  not  wish  to  belabor  the  obvious  but  our  suggestion  that,  in  these  cases,  IBCT 
has  the  appearance  of  CT  without  the  substance  is  important  for  the  following  reason.  It 
might  be  claimed  that  we  interpret  IBCT  results  as  though  they  were  results  about  Krylov 
subspace  methods  (i.e.  NA)  when,  in  fact,  they  are  CT  results  concerning  Krylov  informa¬ 
tion.  In  other  words,  perhaps  we  are  guilty  of  looking  at  the  world  through  NA  spectacles 
and  missing  that  subtle  difference  of  emphasis  characteristic  of  CT.  This  possibility  needs 
to  be  considered  but  the  stubborn  fact  remains  that  restricting  information  to  Krylov  infor¬ 
mation  is  not  part  of  the  linear  equations  problem  nor  of  the  eigenvalue  problem. 

B)  Free  Information  in  the  Problem  Class 

The  ingredient  of  IBCT  that  allows  it  to  generate  irrelevant  results  is  the  problem 
class  F  [see  para.  2  in  (A)].  F  did  not  appear  in  our  brief  description  of  the  theory  in  the 
third  paragraph  of  Section  1  because  it  is  not  a  logically  essential  ingredient  but  rather  a 
parameter  within  IBCT.  Let  us  describe  the  role  it  plays.  There  is  a  task  T  to  be  accom¬ 
plished,  there  is  an  information  sequence  N=(N^,N2,N2,...)  coupled  with  a  measure  of  cost 
(Nj  costs  j  units),  and  there  is  F.  For  each  Nj  the  only  algorithms  admitted  for  con¬ 
sideration  are  those  that  restrict  themselves  to  use  Nj  and  standard  auxiliary  computations. 
For  a  worst-case  complexity  analysis  the  main  technical  goal  is  to  determine  the  minimal 
cost,  over  admissible  algorithms,  required  to  achieve  T  for  the  most  difficult  problem 
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within  F  that  is  consistent  with  N.  This  minimal  cost  may  be  called  CiF,N,T). 

Suppose  now  that  F1C.F2  and  that  C(Fi,N,T)<C{F2,N,T).  Such  a  result  is  of  little 
relevance  to  the  achievement  of  T  unless  one  can  determine  that  a  problem  lies  within  F^ 
rather  than  F2.  To  put  the  matter  in  other  words  we  might  say  that  knowledge  of  member¬ 
ship  in  F  is  information  and  should  have  a  cost  attached  to  it.  Whenever  F  is  very  large 
(for  example,  the  class  of  continuous  functions  or  the  class  of  invertible  matrices)  then  it  is 
realistic  to  assign  no  cost  to  it.  On  the  other  hand  there  are  examples  (see  Section  4.4) 
where  it  may  be  as  expensive  to  ascertain  membership  in  F  as  to  achieve  T,  given  N,  over 
a  larger  class  of  problems.  In  such  cases  C(F,N,T)  bounds  one  pan  of  the  expense  while 
ignoring  the  other.  Let  C(N,T)  denote  C(F,N,T)  when  F  is  as  large  as  possible. 

We  may  reformulate  the  minimax  quantity  CiN,T)  with  the  aid  of  a  useful  piece  of 
notation.  To  each  Nj  there  is  a  set  Vj  of  problems  (matrices  in  our  case)  that  are  indistin¬ 
guishable  by  Nj.  The  sets  Vj,  j  =  l,2,...,/i  are  nested  and  eventually  reduce  to  a  singlet^'” 
Associated  with  any  approximation  z  is  the  set  Rj(z)  of  indistinguishable  residuals  (e.g. 
Rj(z)  =  b  -  A  z ,  AeVj  for  the  linear  equations  problem  A  x  =  b  ).  The  goal  is  to  find  the 
smallest  natural  number  k  such  that  there  is  a  z  for  which  R^iz)  lies  in  the  target  ball  (e.g. 
fi(0,e||hll),  the  ball  in  /?"  centered  at  the  origin  with  radius  £l|h|l).  This  is  C(N,T). 

This  formulation  reveals  several  things.  First,  the  admissible  algorithms  cited  in  the 
minimax  formulation  of  C(N,T)  are  not  really  needed,  what  matters  is  the  size  of  /?^(z) 
for  various  z.  Second,  one  reason  why  there  is  very  little  in  the  NA  literature  on  the  prob¬ 
lem  of  finding  the  minimal  k  is  that  for  most  interesting  tasks  k  =  n,  the  sets  R^iz)  are 
just  too  big,  and  so  the  problem  is  not  interesting. 

One  way  to  reduce  the  indistinguishable  sets  is  to  introduce  a  subclass  F  and  to  use 
Vj  =  F  in  place  of  Vj.  This  was  discussed  above.  For  approximation  theory  there  is 
no  objection  to  the  introduction  of  unknown  quantities  that  might  characterize  F.  How¬ 
ever,  as  mentioned  above,  IBCT  seems  to  use  F  as  a  tuning  parameter  designed  to  keep 
k  <  n. 

C)  A  Confusion  of  Worst  Cases 

An  important  feature  of  Krylov  information  [b,  Ab,  A^b,...]  [see  para.  3  in  (A)]  is 
the  so-called  starting  vector  b  which  is,  of  course,  quite  independent  of  the  goal  of  com¬ 
puting  eigenvalues.  There  are  two  different  factors  that  can  increase  the  cost  of  using  this 
information  to  approximate  eigenvalues  of  A.  One  is  an  intrinsic  difficulty;  some  matrices 
in  the  given  class  may  have  unfortunate  eigenvalue  distributions.  The  other  is  that  b  may 
be  poorly  chosen.  Instead  of  separating  the  effects  of  these  factors  the  eigenvalue  paper 
combines  them  and  so  ends  up  analyzing  Krylov  information  with  a  worst  possible  starting 
vector  even  though  satisfactory  starting  vectors  are  easy  to  obtain.  The  fact  that  b  is 
treated  as  prescribed  data  is  quite  difficult  to  spot.  This  situation  is  in  stark  contrast  to  the 
linear  equations  problem  where  b  is  part  of  the  problem  of  solving  Ax=b. 

The  study  of  worst  choices  for  b  is  not  without  interest.  See  [Sc,  1979],  for  example. 
Such  studies  are  relevant  to  the  inverse  eigenvalue  problem  but  not  to  the  complexity  of 
approximating  eigenvalues  via  Krylov  subspaces. 
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Section  5  discusses  the  issue  in  more  detail  but  the  conclusion  is  that  the  wrong  plac¬ 
ing  of  b  twists  the  model  away  from  its  original  target.  It  is  only  this  distorted  model  that 
permits  the  proof  of  the  surprising  results  in  [Ku,1986,Abstract]. 

D)  Spurious  Challenges 

The  optimality  properties  of  various  Krylov  subspace  methods  are  well  known,  see 
[Sti,1958].  IBCT’s  claim  to  have  something  new  to  add  is  based  on  the  suggestion  that  its 
theory  considers  any  algorithm  (confined  to  Krylov  information)  including  those  which 
give  approximations  z  from  outside  the  Krylov  subspace.  See  the  quotation  in  Section  4.1. 
The  trouble  with  this  apparent  novelty  is  that  it  is  not  possible  to  evaluate  the  residual 
norm  ||h-/4z||  for  these  external  z  because  there  is  no  known  matrix  A  (only  Krylov  infor¬ 
mation).  So  how  can  an  algorithm  that  produces  z  verify  whether  or  not  it  has  achieved  its 
goal  of  making  \\b-A2\\<e\\b\\l  Perhaps  that  is  why  no  such  new  algorithm  is  actually 
exhibited?  IBCT’s  suggestion  that  it  goes  beyond  the  well  known  polynomial  class  of 
algorithms  is  more  apparent  than  real. 

E)  A  New  Eigenvalue  Problem 

The  task  of  computing  some  or  all  the  eigenvalues  of  a  matrix  is  acknowledged  to  be 
of  practical  importance.  When  only  a  few  eigenvalues  of  a  large  order  matrix  are  wanted, 
one  seeks  either  the  smallest  eigenvalues,  or  the  largest,  or  all  in  a  given  region.  Unfor¬ 
tunately  [Ku,1986]  makes  a  subtle  change  in  the  problem.  The  redefined  goal  asks  for  any 
approximate  eigenpair  (value  X  and  vector  x)  without  reference  to  where  in  the  spectrum 
the  approximate  eigenvalue  may  lie.  Of  course  a  theorist  is  entitled  to  investigate  any 
problem  he  or  she  chooses.  However  we  have  yet  to  hear  of  any  use  for  such  output.  Our 
complaint  is  that  no  indication  is  given  that  the  goal  is  an  unusual  one.  Very  few  readers 
would  realize  that  the  familiar  relevant  eigenvalue  problem  has  been  bypassed.  Indeed  we 
missed  the  point  ourselves  until  a  friend  pointed  it  out. 

It  is  standard  practice  to  use  the  size  of  the  residual  norm  (||Ax-j:All)  as  the  means  by 
which  to  decide  whether  a  specified  approximate  eigenvalue  is  accurate  enough.  IBCT 
twists  things  around  and  makes  a  small  residual  norm  into  the  goal.  It  is  the  old  philosoph¬ 
ical  error  of  mistaking  the  means  for  the  end. 
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3.  Preliminaries 

3.1  A  Word  on  Matrix  Computations 

The  subject  begins  with  three  basic  tasks. 

i)  solve  systems  of  linear  algebraic  equations;  written  as  Ax  =  b  with  x ,  b  column 
vectors  and  A  a  matrix, 

ii)  compute  eigenvalues  and  eigenvectors  of  A , 

iii)  solve  least  squares  problems,  i.e.  minimize  II  b  -  Ac  II  over  all  x.  The  Euclidean 
norm  is  used  throughout  this  article. 

There  are  very  satisfactory  programs  for  accomplishing  these  tasks  when  the  matrices 
are  small.  An  nxn  matrix  is  said  to  be  small  if  a  couple  of  n  x  n  arrays  can  be  comfort¬ 
ably  stored  in  the  fast  memory  of  the  computer.  These  days  a  50  x  50  matrix  would  be 
considered  small  on  most  computer  systems.  The  reader  may  consult  [  Pa,  1984  ]  for  more 
information.  The  methods  in  use  make  explicit  transformations  on  the  given  matrix.  There 
are  one  or  two  open  problems  concerning  convergence  of  some  methods  but  by  and  large 
the  small  matrix  problem  is  in  satisfactory  condition  with  respect  to  conventional  one- 
calculation-at-a-time  (sequential)  computers. 

One  reason  for  introducing  this  slice  of  history  into  the  discussion  is  to  bring  out  the 
fact  that  computation  with  large  order  matrices  (say,  5000  x  5000)  is  a  somewhat  different 
game  from  computation  with  small  ones.  Sometimes  the  very  problem  itself  changes.  For 
solving  Ax  =  b,  the  goal  does  remain  the  same  but  often  the  product  Av,  for  any  vector  v, 
can  be  formed  cheaply  and  so  one  seeks  methods  that  exploit  this  feature  and  do  not  factor 
A .  For  the  eigenvalue  problem,  there  are  many  applications  where  only  a  few  eigenvalues 
are  wanted,  perhaps  30  out  of  5000,  and  it  is  desirable  to  avoid  changing  A  at  all.  Thus 
the  task  has  changed;  there  is  no  desire  to  diagonalize  A .  The  third  standard  task,  the  least 
squares  problem,  remains  the  same  and  there  is  no  preferred  approach;  sometimes  the  data 
are  transformed  but  with  strenuous  efforts  to  maintain  sparsity,  at  others  the  original  matrix 
is  not  altered. 

For  all  three  problems  it  often  happens  that  a  sequence  of  similar  cases  must  be 
treated  as  parameters  are  changed.  This  leads  to  updating  problem.  This  ends  our  histori¬ 
cal  digression. 

3.2  A  word  on  Information-based  Complexity 

We  describe  a  simple  version  of  IBCT  that  is  used  in  the  two  papers  to  be  examined. 
It  does  not  use  the  full  panoply  of  concepts  in  the  monograph  or  its  sequel 
[Tr,Wo&Wa,1983]. 

There  are  a  few  essential  ingredients  that  are  best  seen  in  a  specific  context. 

1)  a  class  F; 

e.  g.  F  =  {5  :  5  e  symmetric,  positive  definite} 

2)  a  task; 

e.g.  given  b  ^  0  in  F",  and  e>0,  find  x  in  F"  s.t.  II  Ax-b  II  <e  II  b  II,  A  e  F. 
Here  II  •  II  is  some  given  norm. 
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3)  information  N  =  (Nq, 

e.g.  NjiA,b)=  {b,  Ab,  A^b  }  for  natural  numbers  j<n  ,  A  e  F. 

4)  a  measure  of  cost; 

e.g.  j  units  for  Nj.  In  this  model  the  forming  of  linear  combinations  of  vectors  is 
free. 

Items  2  and  3  do  not  make  it  clear  that  A  is  not  known  explicitly.  There  is  more  discus¬ 
sion  of  this  point  in  Section  4.3. 

To  use  an  algorithm,  restricted  to  the  information  N ,  in  order  to  solve  a  problem  in  F 
will  entail  cost  that  may  vary  with  the  problem.  The  primary  goal  of  worst-case  IBCT  is 
to  minimize,  over  all  such  algorithms,  the  maximum,  over  all  problems  in  F,  of  that  cost. 
Determining  the  minimum,  even  roughly,  is  worth  while  even  if  an  algorithm  that  achieves 
the  minimum  is  not  known.  There  is  an  analogous  average-case  theory. 

This  certainly  has  an  appeal. 

Please  note  that,  in  our  example.  Item  3  puts  this  theory  firmly  within  Numerical 
Analysis.  This  is  because  the  information  in  this  example,  and  it  is  typical,  is  not  part  of 
the  task.  The  information  Nj  will  only  be  available  if  a  certain  type  of  method  is  invoked. 
Consequently  the  theory  is  not  addressing  the  intrinsic  cost,  or  difficulty,  of  solving  linear 
systems  but  is  confined  to  seeking  the  number  of  needed  steps  within  a  chosen  class  of 
methods.  This  is  what  numerical  analysts  do,  and  have  done,  from  the  beginning. 

In  the  1970’s  the  word  ‘complexity’  was  reserved  for  the  intrinsic  difficulty  of  a  task 
and  the  word  ‘cost’  was  used  in  connection  with  a  particular  algorithm.  For  example,  see 
[Bo&Mu,1975]  and  [Wi,1980].  However,  it  is  now  common  to  talk  about  the  complexity 
of  an  algorithm  as  a  synonym  for  cost.  This  extension  of  the  term  complexity  does  no  great 
harm.  What  is  misleading  is  that  the  notion  of  information  appears  to  be  independent  of 
any  algorithm.  This  allows  the  theory  to  talk  about  the  set  of  all  algorithms  that  confine 
themselves  to  the  given  information.  As  indicated  in  the  previous  paragraph  this  way  of 
talking  may  sometimes  be  a  reformulation  of  the  standard  practice  of  optimizing  over  a 
specified  family  of  methods. 

For  j<n  the  information  Nj{A,b)  is  partial;  there  are  many  matrices  that  are  indistin¬ 
guishable  from  A  in  the  sense  that  each  of  them  generates  the  same  set  of  y-t-1  vectors. 
The  basic  technical  concept  in  the  theory,  the  matrix  index  k{0,A)  of  an  algorithm  O,  is 
the  minimal  cost  using  <D  to  guarantee  achievement  of  the  task  for  all  matrices  in  F  that 
are  indistinguishable  from  A  .  There  is  more  discussion  in  Section  4.2  and  Section  2(B). 

For  the  eigenvalue  problem  the  task  is  stated  as:  find  .re  C"  and  p  e  C  such  that 

\\Ax-xp\\<e  for  all  A  in  F  that  are  indistinguishable  from  A.  The  defects  in  this 
definition  were  mentioned  in  Part  (E)  of  Section  2  above. 

The  theory  seeks  min  k(<i>,A)  and  other  related  quantities  while  O  is  restricted  to 
information  N .  This  minimum  is  the  complexity  of  the  task. 
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3J  Krylov  Subspaces 

Here  we  sketch  the  conventional  wisdom  on  this  topic.  These  subspaces  of  are 
defined  by 

(A  ,b)=span(b ,  Ab,  ... ,  A^~^b). 

There  is  no  loss  in  assuming  that  dim  K^=j.  Information  Nj  permits  computation  of  any 
vCcioi  in  not  just  ,  at  no  further  cost  provided  that  the  coefficients  7,  .n 

v=^  Yi  (A‘b)  are  known. 
i=0 

On  the  practical  side  the  dominant  question  for  the  experts  has  been  how  to  obtain  a 
computationally  satisfactory  basis  for  AT-'.  Round  off  error  destroys  the  expected  linear 
independence  of  the  computed  vectors.  Some  researches  maintain  that  it  is  mo. efficient 
to  use  a  highly  redundant  spanning  set  rather  than  a  basis.  Others  recommend  the  addi¬ 
tional  expense  of  computing  an  orthonormal  basis.  In  any  case  it  is  the  computation  of  the 
basis,  or  spanning  set,  along  with  multiplication  of  vectors  by  A  that  is  the  expensive  part 
of  the  computation.  The  model  of  cost  used  in  IBCT  theory  reflects  this  quite  well.  It  is  the 
number  of  ‘steps’  that  matters.  We  think  of  a  step  as  adding  one  more  vector  to  the  Krylov 

sequence  {b,  Ab,  A  b,  ...}. 

One  of  the  founders  of  Krylov  space  method,  C.  Lanczos  [La,  1952],  proposed  a  useful 
basis  for  with  the  property  that  the  projection  of  symmetric  A  onto  is  a  symmetric 
tridiagonal  jxj  matrix  Tj.  Tridiagonal  matrices  are  easy  to  handle.  With  a  basis  in  hand 
there  are  diverse  tasks  that  can  be  accomplished.  Here  is  a  (partial)  list. 

i)  Compute  an  approximation  to  A~^b  such  that  e  AT-'  and  its  residual 
b-Ax^^^  is  orthogonal  to  ATA  It  so  happens  that  II  b-Ax^^^  II  may  be  computed 
without  forming  so  there  is  no  need  to  compute  unsatisfactory  x^^\  When 

A  e  SPD  (sym. ,  pos.  def.  matrices)  then  x^^^  coincides  with  the  output  of  the 
conjugate  gradient  algorithm. 

ii)  Compute  the  vector  that  minimizes  II  b-Av  II  over  all  v  e  AT-'  (not  AT-'^^). 
This  is  the  MR  (minimum  residual)  algorithm.  The  extra  vector  A^ b  is  needed  to 
ascertain  the  coefficients  in  the  expansion  of 

iii)  Compute  some,  or  all,  of  the  Rayleigh-Ritz  approximations  (0,,  y,  ,  i  =  1,  ,  j 

to  eigenpairs  of  symmetric  A.  Here  0,  e  R  and  (y,  ,  ...  ,  Vj)  is  an  orthonormal 
basis  for  .  For  each  i.  Ay-,  -  y,  0,  is  orthogonal  to  kL 

Krylov  subspace  methods  are  not  really  iterative.  All  the  basic  tasks  mentioned  in 
Section  3.1  are  sol  cd  exactly  (in  exact  arithmetic)  in  at  most  n  steps.  However  the  interest 
in  this  approach  is  due  to  the  fact  that  in  many  instances  far  fewer  than  n  steps  are  required 
to  produce  acceptable  approximations.  In  other  words  to  take  n  steps  is  the  practical 
equivalent  of  failure.  However  for  each  basic  task  there  are  data  pairs  (A.  b)  which  do 

require  n  steps  even  for  loose  tolerances  such  as  e  =  10“^.  So  research  has  focussed  on 
explanations  of  why,  so  often,  the  cost  is  much  smaller.  The  gradual  realization  of  the 
efficacy  of  changing  the  Ax  =  h  problem  to  an  equivalent  one  via  a  technique  called 


-  9  - 


preconditioning  has  enhanced  the  use  of  Krylov  subspace  methods. 

In  a  sense  the  ‘  convergence  ’  of  all  these  methods  is  completely  understood,  in  the 
symmetric  case,  and  is  embodied  in  the  error  bounds  published  by  S.  Kaniel  and  improved 
by  C.  C.  Paige  and  Y.  Saad.  See  [Ka,1966],  [Sa,1980],  and  [Pa,1980]  for  the  details.  The 
error  depends  on  two  things;  the  eigenvalue  distribution  of  A  and  the  components  of  the 
staning  vector  b  along  the  eigenvectors.  Of  course,  all  this  analysis  supposes  a  given 
matrix  A ,  not  a  set  of  indistinguishable  matrices. 

From  this  conventional  viewpoint  the  thrust  of  these  two  complexity  papers  is  to  see 
to  what  extent  the  standard  algorithms  (CG,  MR,  Lanczos)  do  not  make  best  use  of  the 

information  on  hand.  Recall  that  Nj  contains  an  extra  vector  not  in  A-'.  This  is  a  reasonable 
project  and  the  results  can  be  expressed  within  the  usual  framework.  The  term  ‘complexity 
theory’  appears  to  a  numerical  analyst  like  window  dressing. 
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4.  On  the  Optimal  Solution  of  Large  Linear  Systems 

These  sections  offer  a  description  of  and  commentary  on  [Tr&Wo,l984]. 

4.1  Spurious  Generality 

Here  is  a  quotation  from  the  introduction: 

‘We  contrast  our  approach  with  that  which  is  typical  in  the  approximate  solution 
of  large  linear  systems.  One  constructs  an  algorithm  <[>  that  generates  a  sequence 
approximating  the  solution  a  =  A~^b-,  the  calculation  of  requires  k 
matrix-vector  multiplication  and  x^,  lies  in  the  Krylov  subspace  spanned  by 
b,  Ab,  ...  ,  A'^b.  The  algorithm  4)  is  often  chosen  to  guarantee  good  approxima¬ 
tion  properties  of  the  sequence  In  some  cases  O  is  defined  to  minimize 

some  measure  of  the  error  in  a  restrictive  class  of  algorithms.  For  instance,  let 
this  class  be  defined  as  the  class  of  ‘polynomial’  algorithms;  that  is 

a-JCfc  =  Wi^(A)a,  where  W^(0)  =  1. 

Here  is  a  polynomial  of  degree  at  most  k. 


<Some  omitted  sentences  define  the  minimum  residual  and  conjugate  gradient 
algorithms.> 

It  seems  to  us  that  this  procedure  is  unnecessarily  restrictive.  It  is  not  clear,  a 
priori,  why  an  algorithm  has  to  construct  x^  of  the  form  a-Xj^  =  W,^{A)a. 

Indeed,  we  show  that  for  orthogonally  invariant  classes  of  matrices  the  minimum 
residual  algorithm  (MR)  is  within  at  most  one  matrix  vector  multiplication  of  the 
lower  bound  without  any  restriction  on  the  class  of  algorithms.  However,  if  the 
class  is  not  orthogonally  invariant,  the  optimality  of  MR  may  disappear.’ 

Our  first  point  was  made  earlier.  The  information  N  does  not  come  with  the  linear 
equations  problem.  The  brief  answer  to  the  quoted  rhetorical  question  (why  must  an  algo¬ 
rithm  construct  Xj^  of  the  given  form?)  that  serves  to  justify  the  whole  paper  is  the  follow¬ 
ing.  To  any  vector  x  NOT  in  the  Krylov  subspace  K*  there  is  an  admissible  matrix  A 
such  that  the  residual  norm  is  as  large  as  you  please.  This  holds  even  when  <4  is  required 
to  be  symmetric  and  positive  definite.  An  admissible  matrix  A  is  one  that  is  consistent 
with  the  Krylov  information.  More  on  this  below. 

4.2  Definitions  and  Optimality 

In  this  section  we  put  down  the  definitions  made  in  [Tr&Wo,1984].  Our  comments 
are  reserved  for  the  next  section. 

i)  Let  F  be  a  subclass  of  the  class  GL{n,R)  of  «x/j  nonsingular  real  matrices. 

ii)  Let  b  e  R''  with  W  b  W  =  {b,b)^'^  =  1  be  given. 

For  0  <  £  <  1,  find  x  e  R''  such  that 

l(  b-Ax  II  <  £,  A  e  F, 
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iii) 

iv) 
V) 


vi) 


vii) 


viii) 


ix) 


X) 


xi) 


xii) 


Krylov  information:  Nj(A,b)  =  [b,  Ab,  ,  A^b),  ;'  =  0,1,  ... 

Measure  of  cost:  Nj  costs  j  units. 

An  algorithm  0=  {  0^  }  is  a  sequence  of  mappings 

0-  NjiFM’') 


The  set  of  indistinguishable  matrices  for  given  Nj{A,b): 

V{Nj(A,b))  =  {A:AeF:  Nj(A,  b)  =  NjiA,  b)}. 
The  matrix  index  of  an  algorithm  <I): 


where 


/:(<!>,  =  min  {  j:  max  I16-Ajc,|l  <  e}, 

AeV(Nj) 


Nj  =  NjiA,b), 


Xj  =  <J>jiNjiA,b)). 

If  the  set  of  j  values  is  empty  then  ki^jiNj(A,b)  =  o®. 

The  class  index  of  an  algorithm  O: 

F)  =  max  k((i>,  B). 
BeF 


The  optimal  matrix  index: 

k(A)  =  min  k(^.  A)  over  O  restricted  to  N. 
<D 


The  optimal  class  index: 


k(F)  =  max  k{B). 
BgF 


Strong  optimality:  O  is  strongly  optimal  iff 

it(<I),  B)  =  k(A),  for  each  B  g  F. 


Optimality:  O  is  optimal  iff 

F)  =  k(F). 


xiii)  Almost  strong  optimality:  <!>  is  almost  strongly  optimal  iff 

k(^,B)  <  k(B)  +  c,  for  every  B  g  F, 
for  some  small  integer  c. 

Remark  1.  Since  A'b  =  A  b)  it  follows  that  Krylov  information  Nj{A,b) 

requires  j  applications  of  the  operator  A .  That  is  why  the  cost  is  given  as  j  units.  In 
practice  one  uses  better  bases  for  the  Krylov  subspace  than  is  provided  by 
Nj(A,b)  but  for  theoretical  purposes  this  modification  may  be  ignored. 

Remark  2.  It  can  happen  that  ik(A)  <«:  k{F).  For  this  reason  it  is  of  interest  to  find 
algorithms  with  small  matrix  index. 
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Remark  3.  For  simplicity  the  dependence  of  ail  concepts  on  n,  Nj,  b,  and  e  is 
suppressed.  The  idea  is  to  compute  kiA)  and  k(F)  for  interesting  classes  F  and  to 
find  strongly  optimal  or  optimal  algorithms  if  possible. 

4.3  Discussion  of  the  basic  concepts 

In  Section  2  we  pointed  out  how  misleading  it  can  be  to  compute  complexity  for  res¬ 
tricted  classes  F  that  are  difficult  to  discern  in  practice.  Here  we  wish  to  point  out  that  F 
is  introduced  into  the  basic  definitions,  such  as  V  in  (vi),  and  there  is  no  need  for  it. 

To  add  to  any  confusion  the  basic  definitions  do  not  make  clear  the  role  of  /I .  In  ihe 
context  of  numerical  analysis  there  is  a  particular  matrix  A  on  hand  and  this  permits  one  to 
test  the  residual  r  =  h  -  Av  for  any  vector  v.  However  in  the  context  of  IBCT  that  is  not 
quite  the  case.  In  this  game  we  consider  a  specific  A  but  it  is  not  available  explicitly.  That 
odd  situation  certainly  warrants  some  discussion  and  it  faithfully  reflects  the  state  of  affairs 
at  a  typical  step  in  a  Krylov  subspace  method.  The  matrix  is  hidden  inside  a  subprogram 
and  the  user  has  only  a  basis  for  the  Krylov  subspace  corresponding  to  Krylov  information 
Nj(A,b)  =  {b,  Ab ,...,A^ b} .  Associated  with  Nj(A,b)  is 

V(Nj(A,b))  =  {  A:Nj(A,  b)  =  Nj{A,  b)], 

the  set  of  matrices  indistinguishable  from  A  by  Nj(A,b).  Contrast  V  with  V  in  Section  4.2 
(vi). 

With  V  defined  above  the  natural  definition  of  the  matrix  index  of  an  algorithm  O  is 
k{^.  A)  =  min  {  j:  max  ||h-AXj||  <  c},  N:  =  NjiA,b), 

A€V(Nj)  ^ 

where 


=  O/iV/A,^;)). 

If  the  set  of  ]  values  is  empty  then  ^(<I>,A)  =  ©o.  Please  note  that  in  contrast  to  (vii)  in 
Section  4.2  there  are  no  hidden  parameters  in  k.  It  is  the  first  step  in  the  process  at  which 
the  task  is  accomplished  by  d>  for  all  matrices  indistinguishable  from  A  by  Aj  through  . 
Then  the  optimal  index  is 

k(A)  =  min  k(0.  A) 

over  all  O  such  that  Oy  uses  only  NjiA,b)  and  standard  arithmetic  operations. 

There  is  no  logical  need  for  F .  However  given  a  class  F  one  may  define 

k{0,F)  =  max  ^(d>,  fi);  k{F)  =  min  ki<b,F). 

BeF  O 

Why  did  IBCT  not  follow  this  simple  approach?  Why  does  IBCT  use  V  =  V  Pi  F  to 
define  the  matrix  index  k{<!>,  A)  and  thus  suppress  the  role  of  F?  The  reason,  we  suspect, 
is  that  with  these  natural  definitions  the  ’polynomial’  algorithms,  deemed  to  be  restrictive 
in  the  introduction  to  [Tr&Wo,1984]  are  mandatory  and  consequently  IBCT  has  nothing 
new  to  offer.  Here  is  the  key  observation. 
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THEOREM: 

Assume  A  =  A‘  e  Let  AT-'  =  span(b,Ab,...,A^~^b)  have  dimension  j  (<n).  To  each 

there  exists  A  e  V{Nj(A,b))  such  that  ||d  -  Av||  >  1. 


Sketch  of  proof. 

1.  Should  b  be  orthogonal  to  some  eigenvectors  of  A  it  is  always  possible  to  choose  an 
A  eV(Nj(A,b))  such  that  b  is  not  orthogonal  to  any  of  A ’s  eigenvectors.  If  necessary 
replace  A  by  A . 

2.  There  is  a  distinguished  orthonormal  basis  for  AT-'  that  can  be  extended  to  a  basis  for 
R  "  and  in  which  A  is  represented  by  the  matrix 

V  Q  ' 

U 

where  T  =  T‘  e  ,  □  is  entry  (y'+lj)  and  0',y+l),  a  =  P  ^0.  Moreover 

T  and  /3  are  determine  by  Nj(A,b). 

3.  In  this  distinguished  basis  b  is  represented  by  ej  =  (1,0,0,...,0)'.  Let  v  be  represented 

f  where  /  €  /?■',  g  e  R'*~K  By  hypothesis  g  ^  0,  since  and 

II  ft  -  Av  II  =  II  «!  -/  -  Cj  P  g(l)  Ip  +  Iki  Pfij)  +  Ug\\\ 
where  U  is  the  (2,2)  block  in  the  representation  of  A  . 

4.  To  each  U  €  there  is  an  A  e  V{Nj{A,b))  and  thus,  for  any  g  0,  there 

exists  U  such  that 

II  e,PfU)  +  Ug\\^>\. 

In  panicular  it  is  possible  to  select  f/  to  be  symmetric,  positive  definite. 

Here  ends  the  sketch  of  the  proof. 

Symmetry  is  not  needed  in  the  result  given  above.  If  A  is  not  symmetric  there  is  still 
a  distinguished  orthonormal  basis  for  Ar'(A,ft)  and  /?"  such  that  ft  is  represented  by 
and  A  is  represented  by 

H  y 

°  L 

Now  H,  a  =  p  ^0,  and  col  1  of  7  are  determined  by  Nj{A,b).  Moreover  Je^  ^  0  and  for 
A  indistinguishable  from  A  we  have 

II  ft  -  A  V  Ip  =  \\e,-Hf-Jgf  +  II  ej  pfU)  +  Lg  |p. 

This  can  be  made  as  large  as  desired.  In  the  language  of  IBCT  ic{<t>,A)  =  for  any  O 
such  that  0(A/^(A ,ft))  takes  values  outside  K^{A,b). 
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Only  two  choices  were  left.  Either  turn  away  to  unsolved  problems  or  cut  down  the 
number  of  indistinguishable  matrices  by  using 

V  =  vnF 

instead  of  V. 

Here  is  the  quandary  for  IBCT.  If  F  is  chosen  too  small  the  model  loses  realism.  If  F 
is  allowed  to  be  large  then  the  standard  ’polynomial’  regime  is  optimal. 

4.4  Discussion  of  Results 

The  main  result  of  the  paper  concerns  the  minimal  residual  algorithm  MR:  this  poly¬ 
nomial  algorithm’s  output  for  the  information  Nj  =  [b,  Ab,  ,  A^b}  is  the  vector  in  the 
Krylov  subspace  =  span(b,  Ab,  ... ,  A^~^b)  that  minimizes  the  residual  norm,  so  MR  is 

optimal  in  KK  Given  Nj  MR  needs  A:(MR,.4)  steps  to  achieve  an  e  -  approximation  in  the 
worst  case.  However  IBCT  says 

‘Theorem  3.1 

If  F  is  orthogonally  invariant  then 

k(MR,  A)  >  k(A)  >  jk(MR,  A)-l.  for  any  AeF. 

Furthermore  both  the  upper  and  lower  bounds  can  be  achieved.’ 


Recall  that  k{A)  is  the  minimal  number  of  steps  over  all  admissible  algorithms. 

The  fact  that  MR  is  NOT  always  strongly  optimal  for  the  given  information  appears 
to  give  substance  to  the  theory.  It  will  astonish  the  numerical  analyst,  so  let  us  look  at  the 
example  that  purports  to  show  that  MR  is  not  always  optimal  for  the  given  information; 
Example  3.2  in  the  paper.  This  class  is 

Fp=  [A:  A  =  B‘,  ||B1|  ^  p  <1  }. 

When  e,p  and  n  are  specially  related  so  that 


/n(((l-Kl-e^)^^^)/e) 
[inm+d-p^f^Vp)  ^ 


<  n 


then  MR  is  just  beaten  by  another  polynomial  algorithm,  called  the  Chebyshev  algorithm, 
because 


/:(Cheb,  4)  =  q(.e),  k(MR,A)  =  q(e)  +  1. 


A  word  of  explanation  is  in  order.  Recall  that  A^b  e  Nj(A,b)  but  A^b  ^  kK  The 
MR  algorithm  needs  A^b  to  compute  the  coefficients  y,  in 

MR(N/4,b))  =  ^  y.  iA‘b). 

1=0 

This  always  beats  the  Cheb  output  from  K^.  However  Cheb  can  use  the  well-known 
three-term  recurrence,  based  on  p,  to  obtain  its  equi-oscillation  approximation  from  AT 
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not  just  kL  With  the  right  relation  between  e,  p,  and  n  one  has 

\\b  -  A  MR(iV,(,))  II  >  II  b  -  A  Cheb(iV^(^))  ||  >  I|  b  -  A  ||. 

Is  it  fair  to  compare  them?  The  theory  claims  to  compare  algorithms  restricted  solely 
to  information  Nj.  So  how  could  the  Cheb  algorithm  obtain  the  crucial  parameter  p?  The 

answer  is  that  p  is  found  in  the  definition  of  the  problem  class  Fpl  In  other  words, 
knowledge  that  Cheb  can  use  is  passed  along  through  the  problem  class,  not  the  informa¬ 
tion. 

The  important  point  we  wish  to  make  is  not  just  that  comparisons  may  not  be  fair  but 
that  the  results  of  IBCT  tell  us  as  much  about  the  idiosyncrasies  of  its  framework  as  they 
do  about  the  difficulty  of  various  approximation  problems.  With  a  realistic  class  such  as 
SPD  (sym.  pos.  def.)  MR  is  optimal  (strongly)  as  it  was  designed  to  be,  and  as  is  well- 
known. 

In  more  recent  work  [Wo,1985,Tr&Wo,1988]  the  flaw  mentioned  above  appears  to  be 
corrected  and  the  parameter  p  is  put  into  the  information  explicitly.  Again  Cheb  wins  by  1 
because  it  uses  p  while  MR  does  not.  However  this  new  clarity  comes  at  the  expense  of 
realism;  The  Krylov  information  is  scrupulously  priced  while  p  comes  free.  Yet  member¬ 
ship  in  Fp  may  be  more  difficult  to  ascertain  than  the  approximate  solution. 

Although  the  IBCT  paper  does  not  mention  the  possibility  Krylov  information  may  be 
used  to  obtain  lower  bounds  on  p  that  get  increasingly  accurate  as  the  dimension  of  the 
Krylov  subspace  increases.  Algorithms  that  exploit  knowledge  of  the  spectrum  will  have 
good  average  behavior  but  there  is  little  room  for  improvement  in  the  worst  case. 

The  simple  facts  are  well  known:  Chebyshev  techniques  are  powerful  and  users  are 
willing  to  do  considerable  preliminary  work  to  estimate  parameters  such  as  p.  It  is  not 
clear,  and  depends  strongly  on  the  data,  when  it  is  preferable  to  use  a  weaker  method  such 
as  MR  that  does  not  need  extra  parameters.  The  result  that  MR  is  only  almost  strongly 
optimal  is  a  striking  example  of  obfuscation.  The  framework  of  IBCT  permits  unnatural 
comparison  of  algorithms. 

Embracing  the  Conjugate  Gradient  Algorithm. 

In  Section  4  of  their  paper  the  authors  generalize  the  framework  to  cover  other  known 
methods  such  as  the  conjugate  gradient  algorithm.  All  that  is  necessary  is  to  introduce  a 

parameter  p  into  the  basic  task.  Now  an  £-approximation  to  A~^b  is  redefined  as  any 
X  €  R''  that  satisfies 


\\  AP(x-A-^b)\\<e\\AP-^b  ||. 

The  cases  p=0,  1/2,  1  are  the  most  important,  and  when  p  is  not  an  integer  it  is  appropriate 
to  restrict  attention  to  the  symmetric,  positive  definite  subset  SPD  of  When  p  =  l 

we  recover  MR.  To  generalize  MR  to  p=0  it  is  necessary  to  use  the  normal  equations  of  a 
given  system.  The  new  feature,  slipped  in  without  mention,  is  that  with  p  <  1  the  right 
hand  side  of  the  definition  is  NOT  DIRECTLY  COMPUTABLE.  So  how  does  an  algo¬ 
rithm  know  when  to  terminate?  Please  note  that  Approximation  Theory  can  present  results 
that  are  not  computable  without  a  blush.  Approximation  Theory  merely  exhibits 


-  16  - 


relationships.  What  is  the  virtue  of  attaining  the  desired  accuracy  if  it  cannot  be  detected? 
The  fact  that  the  Conjugate  Gradient  algorithm  (p  =  1/2)  minimizes  ||b-Ajr||^  at  each  step 
is  well  known,  see  [Da,  1967].  However,  in  practice,  the  algorithm  is  usually  terminated 
when  ||b-Ax||<£l|b||  because  the  desirable  A  norm  is  not  available,  see  [Go&Va,1984]. 

Section  5  of  [Tr&Wo,1984]  considers  open  problems.  In  particular  the  authors  want  to 
go  beyond  Krylov  information  if  only  for  the  sake  of  settling  their  conjecture  that  Krylov 
information  is  optimal  in  the  class  of  information  operators  of  the  form 

Nj{A,b)  =  [b,  Azq,  ...  ,  Azj],  where  Zj  is  determined  from  A/j.i. 

4.5  An  Interesting  Result 

Recall  that 

Nj{A,  b)  =  {b,  Ab,  ... ,  AJb) 

=  span  [b,  Ab,  ... ,  A^b) 

ViNjiA,  b))=  [  A:  NjiA,  b)=Nj{A,  b)). 

Theorem  [Tr«&Wo,1984]  (reformulated  by  us) 

If  y  G  R"  yields  an  e  residual  norm,  H  b  -  Ay  1|  <  e  l|b||,  for  all  A  g  V(Nj(A,  b)) 
then  so  does  its  orthogonal  projection  z  onto 

We  offer  a  simplified  version  of  the  argument  in  [Tr&Wo,1984]. 

Proof:  There  are  5  steps. 

i)  Either  y  g  and  there  is  nothing  more  to  prove  or  y=z+w,  with 

z  €  and  O^w  onhogonal  to 

ii)  For  the  vector  w  defined  in  i)  there  is  a  unique  symmetric  orthogonal  matrix 
H  =  Hiw),  called  the  reflector  that  reverses  w.  In  particular 

a)  Hx  =  X,  for  x  orthogonal  to  w,  b)  Hw  =  -w. 

Define  an  auxiliary  matrix  A  by 

iii)  A  =  HAH  G  V{NjiA,b))  since,  by  use  of  a), 

A‘b=HA‘Hb  =  HA'b  =  A‘b,  i  =  I,  ...  ,  j. 

Note  that 

Ay-b  =  HAHy-b 

=  HA(z-w)-b ,  using  (j)  and  (ii){b). 

=  H(Az-b-Aw),  using  Hb=b. 

This  shows  the  crucial  relationship 

iv)  ||Ay-/)||  =  \\Az-b-Aw\\,  since  H  preserves  norms. 

Hence 
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v)  {|lAz-^?-i4w||+llAz-6+/4wl|},  by  triangle  inequality, 

=  by  (iv),  and  (i), 

l|i»||,  by  hypothesis. 

Recall  that  z  is  y’s  projection  onto  Hence  Az  =  Az  for  all  A  €  V(Nj{A,  b)) 

and  so 


l|Az-bii  =  ||Az-b|l 

<  e  ||^)||,  by  (v).  QED 

This  theorem  explains  why  MR  cannot  lag  more  than  one  step  behind  any  algorithm 
that  produces  an  e  residual  norm  for  all  matrices  indistinguishable  from  A.  For,  by 
definition,  MR  produces  from  Nj^^iA,  b)  (note  the  increased  subscript)  the  unique  vector 
in  that  gives  the  smallest  residual  and  so  is  at  least  as  good  as  the  vectors  y  and  z 
defined  in  the  proof  above.  But  y  could  be  the  output  of  a  rival  algorithm. 

Our  formulation  of  the  lemma  omits  any  mention  of  the  class  F .  Now  it  is  clear  why 
the  hypothesis  that  F  should  be  orthogonally  invariant  appears  in  most  of  the  theorems. 
Recall  that  V(NjiA,b))  is  ‘too  big’.  To  cut  down  the  number  of  ‘indistinguishable’ 
matrices  the  theory  uses  VPi  F  =  V.  To  make  use  of  the  theorem  it  is  necessary  to  have 
HAH  e  F  and  this  will  be  so  provided  that  F  is  orthogonally  invariant. 

4.5  Summary 

The  hidden  defect  in  the  framework  for  discussing  the  MR  algorithm  is  the  far  reach¬ 
ing  feature  that  allows  the  family  F  to  convey  what  most  people  would  call  free  informa¬ 
tion  behind  the  back  of  the  information  operator  N. 

More  disturbing  than  the  previous  defect  is  that  we  cannot  see  how  any  algorithm 
other  than  the  well  studied  polynomial  ones  could  know  when  it  had  achieved  an  e- 
approximation  if  it  is  restricted  to  the  given  information.  This  gives  rise  to  a  feeling  that 
[Tr&Wo,1984]  managed  to  create  an  artificial  problem  where  no  real  puzzle  exists.  The 
quoted  theorems  (3.1  and  4.2)  reflect  only  the  propensity  of  their  General  Theory  of 
Optimal  Algorithms  for  creating  such  situations. 
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5.  Optimal  Solution  of  Large  Eigenpair  Problems 

This  section  offers  a  description  of  and  commentary  on  [Ku,1986].  The  paper  demon¬ 
strates  cleverness  and  clean  exposition  but  nevertheless  suffers  from  design  flaws;  it 
equates  different  versions  of  a  given  algorithm  and  it  redefines  a  standard  task. 

From  the  abstract: 

‘The  problem  of  approximation  of  an  eigenpair  of  a  large  nx/t  matrix  A  is  con¬ 
sidered.  We  study  algorithms  which  approximate  an  eigenpair  of  A  using  the 

panial  information  on  A  given  by  b ,  Ab ,  A^b ,  b ,  j  <  n,  i.e.  by  Krylov 

subspaces.  A  new  algorithm  called  the  generalized  minimal  residual  algorithm 
(GMR)  is  analyzed.  Its  optimality  for  some  classes  of  matrices  is  proved.  We 
compare  the  GMR  algorithm  with  the  widely  used  Lanezos  algorithm  for  sym¬ 
metric  matrices.  The  GMR  and  Lanezos  algorithms  cost  essentially  the  same  per 
step  and  they  have  the  same  stability  characteristics.  Since  the  GMR  algorithm 
never  requires  more  steps  than  the  Lanezos  algorithm,  and  sometimes  uses  sub¬ 
stantially  fewer  steps,  the  GMR  algorithm  seems  preferable. 

....  The  Fortran  subroutine  is  also  available  via  ....  ’ 

This  last  phrase  shows  that  the  subject  matter  is  firmly  within  the  field  of  numerical 
analysis.  Implementation  issues  concerning  GMR  are  described  in  [Ku,1985]. 

5.1  A  Subtle  Change  of  Goal 
Here  are  the  first  five  lines. 

‘Suppose  we  wish  to  find  an  approximation  to  an  eigenpair  of  a  very  large  matrix 
A.  That  is,  we  wish  to  compute  (x,  p),  where  x  is  an  nxl  normalized  vector, 
11x11=1  ,  and  p  is  a  complex  number  s.  t. 

||Ax-xp||  <  e  (1.1) 

for  a  given  positive  e.  Here  II  .  II  denotes  the  2-norm.’ 

It  is  all  too  easy  to  assent  to  this  statement  of  the  problem  and  pass  on  to  the  rest  of 
the  anicle.  However  it  is  NOT  the  normal  eigenvalue  problem.  We  are  not  aware  of  any 
demand  at  all  for  the  accomplishment  of  this  particular  task.  The  users  of  eigenvalue  pro¬ 
grams  (engineers,  theoretical  chemists,  theoretical  physicists)  want  eigenvalues  in  specified 
parts  of  the  spectrum;  occasionally  they  want  the  whole  spectrum.  The  main  concern  of 
this  article  is  with  symmetric  matrices;  and  because  their  eigenvalues  are  real  the  usual 
demands  are  for  the  leftmost  p  eigenvalues  (for  some  p<n  )  or  the  rightmost  p  eigenvalues 
or  for  all  eigenvalues  in  a  given  interval.  Eigenvectors  may  or  may  not  be  wanted.  There  is 
nothing  inherently  wrong  with  restricting  attention  to  the  rather  special  case  p=l  and  a 
few  articles  (not  cited  by  Kuezynski)  have  been  devoted  to  it.  See  [0’L,Ste&Va,1979]  and 
[Pa,Si&Str,1982]  for  the  details. 

To  support  our  description  of  user’s  demands  we  refer  to  three  publications  from 
different  fields,  [Cu&Wi,1985,Introduction’,  [Je,1977,Ch.7],  [Sh,1977,Sect.6]. 

The  consequences  of  leaving  out  a  vital  aspect  of  the  usual  task  are  most  serious  pre¬ 
cisely  when  one  seeks  optimal  performance. 

One  reason  why  it  is  so  easy  to  overlook  the  omission  in  the  problem  statement  is 
that,  for  symmetric  matrices,  almost  everyone  does  use  the  residual  norm  II  Ax-xp  II  to 
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judge  the  accuracy  of  an  approximate  eigenpair  (x,  p).  However  it  is  not  very  interesting 
to  minimize  the  residual  norm  if  that  might  yield  a  p  in  the  unwanted  part  of  the  spectrum. 
Now  a  pure  mathematician  is  free  to  define  his  goal  at  will.  What  is  regrettable  is  that  no 
hint  is  given  to  the  reader  that  the  goal  is  not  standard. 

We  say  more  about  the  e  appearing  in  (1.1)  in  Section  5.5. 

We  mention  one  other  fact  which  may  be  news  to  readers  who  are  not  much  con¬ 
cerned  with  eigenvalue  problems.  It  suggests  why  the  direction  taken  by  Kuczynski  has  not 
appeared  in  the  literature  before.  If  we  are  given  a  symmetric  matrix  A  and  seek  a  single 
eigenvalue  (with  or  without  its  eigenspace)  then  the  wanted  eigenvalue  is  almost  certain  to 
be  either  the  leftmost  or  the  rightmost  one.  Recall  that  the  Rayleigh  quotient  of  a  column 

vector  jc  5^  0  in  /? "  is  the  number  Axlx‘x.  The  extreme  eigenvalues  of  A  are  the 
extreme  values  of  the  Rayleigh  quotient  over  all  possible  vectors  x  in  /?  ".  It  happens  that, 
for  the  given  Krylov  information  Nj,  the  Lanczos  algorithm  is  optimal  for  this  task  in  the 
strong  sense  that  it  yields  the  leftmost  and  rightmost  values  of  the  Rayleigh  quotient  over 
all  vectors  in  the  ‘available’  space  .  The  last  vector  A-'b  in  Ny  is  needed  to  ascertain 
the  extreme  values  in  .  Thus  the  problem  is  settled.  It  is  a  pity  that  this  well  known  fact 
was  not  mentioned. 

5.2  Choosing  a  Bad  Starting  Vector 

The  particular  aspect  of  Information-based  Complexity  Theory  adopted  in  the  paper 
under  review  is  called  worst-case  complexity.  It  seeks  bounds  on  the  cost  of  computing  e- 
approximations  over  all  matrices  in  certain  classes  F  and  over  ALL  starting  directions  b. 
Theorems  3.1,  3.2,  4.1,  5.1  (there  is  no  theorem  1.1  or  2.1)  in  [Ku,l986]  are  examples.  In 
particular  the  theory  must  cover  what  can  happen  with  the  worst  possible  starting  vector. 
Theorem  3.1  is  quoted  in  full  in  Section  5.4. 

There  is  nothing  wrong  with  studying  the  worst  case.  Indeed  it  has  already  been  done. 
[Sc,  1979]  is  a  paper  with  the  clear  title  ‘  How  to  make  the  Lanczos  algorithm  converge 
slowly’.  The  author  gives  formulae  for  a  starting  vector  that  prevents  any  Rayleigh  Ritz 
approximation  from  converging  until  the  final  step!  Scott’s  paper  appeared  in  Mathematics 
of  Computation,  the  principal  outlet  for  Numerical  Analysis  of  the  American  Mathematical 
Society,  but  it  is  not  referenced  in  [Ku,1986].  The  fact  that  SOME  Krylov  subspaces  can 
be  very  badly  aligned  with  A ’s  eigenvectors  does  prevent  worst-case  analysis  from  shed¬ 
ding  much  light  on  how  Krylov  subspaces  approach  certain  eigenvectors  in  the  usual  case 
of  a  random  starting  vector.  That  study,  of  course,  comes  under  average-case  analysis  and 
is  ripe  for  attention. 

Please  note  that  this  comment  is  quite  independent  of  comparisons  of  GMR  and  Lanc¬ 
zos.  The  points  is  this.  The  starting  vector  b  is  a  free  parameter  in  the  eigenvalue  prob¬ 
lem  (in  contrast  to  the  linear  equations  problem  Ax=b).  It  is  not  given  and  may  be  chosen 
to  improve  performance.  In  the  absence  of  extra  information  it  is  the  almost  universal 
habit  to  pick  b  with  the  aid  of  a  random  number  generator.  Recent  theoretical  work  on 
Lanczos  has  been  concerned  to  explain  why  this  choice  is  so  powerful.  See  [Sa,1980]  and 
[Pa,  1980].  Note  that  two  quite  different  situations  have  been  pushed  together  under  the 
label  ‘worst  case’.  It  is  quite  normal  to  consider  the  most  difficult  matrices  A  because  they 
are  part  of  the  problem.  On  the  other  hand  a  bad  b  is  a  self-inflicted  handicap  rather  than 
a  genuine  difficulty.  It  is  the  confounding  of  these  cases  that  is  unfortunate,  not  their 
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study. 

Returning  to  the  eigenvalue  problem,  we  can  rephrase  our  complaint  as  follows. 
Kuczynski’s  focus,  perhaps  unwittingly,  is  on  KryIov-subspaces>with*worst-possible- 
starting-vectors.  What  a  pity  that  this  was  not  emphasized!  The  numerical  examples 
given  in  the  paper  are  not  without  interest.  The  starting  vector  there,  though  not  perhaps 
worst  possible,  is  very  bad.  Both  methods,  GMR  and  Lanczos  converge  very  slowly.  The 
chosen  matrices  are  extensions  of  the  one  used  by  Scott  to  illustrate  how  bad  Rayleigh  Ritz 
approximations  can  be.  See  [Pa,1980,p.218]. 

We  ran  these  examples  with  our  local  Lanczos  program.  It  uses  a  random  starting  vec¬ 
tor,  and  convergence  was  quite  satisfactory.  The  results  are  given  in  Section  5.5. 

There  is  a  different  context,  in  which  the  focus  on  worst  starting  values  is  much  more 
relevant.  The  GMR  algorithm  presented  by  Kuczynski  is  a  generalization  of  the  MR 

(minimum  residual)  algorithm  used  to  compute  approximations  to  A~^b.  There  one  seeks 
vectors  X  in  R"  s.t.  II  Ax-b  II  <  e  II  h  II  .  A  well  chosen  subspace  may  be  used  to  gen¬ 
erate  approximate  solutions  at  low  cost.  It  is  advisable  to  ensure  that  the  right  hand  side  is 
in  the  chosen  subspace  and  this  consideration  leads  one  to  choose  the  subspace  .  In  this 
context  b  is  part  of  the  data  (A,  b)  and  is  not  at  our  disposal.  The  study  of  bad  b's  is 
relevant  to  a  study  of  the  complexity  of  Krylov  space  methods  for  linear  equations.  How¬ 
ever  it  has  been  appreciated  from  the  beginning  that  for  reasonable  e  and  unfortunate  b 
then  n  steps  will  be  required  unless  A  is  close  to  the  identity  matrix.  See  [Ka,1966,Thr.4.3] 
and  [Me,1963]. 

To  burden  the  Lanczos  algorithm  (or  GMR)  with  unnecessarily  bad  starting  vectors 
for  the  eigenvalue  problem  is  like  studying  the  performance  of  Olympic  athletes  only  when 
they  suffer  from  some  rare  affliction  like  poison  ivy. 

5.3  Redefining  the  Lanczos  Algorithm 

The  new  algorithm  GMR  is  contrasted  with  the  well  known  Lanczos  algorithm.  Here 
is  Kuczynski’s  definition  of  the  Lanczos  algorithm,  from  p.l42.  The  subspace  Aj  is  our 
Krylov  subspace  K-’ . 

‘Perform  the  following  steps. 

1.  Find  an  onhonormal  basis  q^,q2,...,qj  of  the  subspace  Aj-,  Let 
Qj  =  (q^  ...  ,qj)  be  the  nxj  matrix. 

2.  Form  the  jxj  matrix  Hj  =  Qj'AQj;  compute  eigenpairs  of  Hj-, 

~  ^8i,8m)  ~  ^im,  L  ~  >  J- 

3.  Compute  the  Ritz  vectors  z,  =  Qjg^  and  the  residual  r:^  =  min  ||i4z,-0  z  II  for 

^  ^  \<i<j 

1  <  /■  <  j. 

4.  Define  Zj  =  {(z,  0j),  i  =  1,  2 . j:  II  /Iz,  -Z;0,  II  = 

The  jth  step  of  the  L  algorithm  is  defined  by 

Oj^iNj{A,b))={x,,  Pk), 
where  (x^  p^)  is  an  arbitrary  element  from  Zj.’ 


The  trouble  is  that  steps  3  and  4  have  been  changed  from  the  usual  ones  to  conform 
with  the  idiosyncratic  goal  discussed  in  Section  5.1.  However,  no  mention  of  this  fact  is 
made. 

Here  is  the  conventional  description  wherein  it  is  supposed  that  p  eigenvalues  are  to 
be  approximated.  It  is  from  [Pa,1980,p.214] 

2.  ‘  Form  the  jxj  matrix  Hj  =  Q/AQj;  compute  the  p(<J)  eigenpairs  of  Hj  that 
are  of  interest,  say 

f^jgi  =  8A.  '  =  1.  .P 

The  0,-  are  Ritz  values,  0,  <  02 . <  9j  .  Equality  is  not  possible. 

3.  If  desired,  compute  the  p  Ritz  vectors  z,-  =  Qjgi,  i  =  I . p.  The  full  set 

{(0,-  z,),  j  =  l,  ...,  j}  is  the  best  set  of  j  approximations  to  eigenpairs  of  A  that 
can  be  derived  from  Aj  alone. 

4.  Residual  error  bounds. 

Form  the  p  residual  vectors  r,  =  Azi-z^di-  Each  interval  [0^-  II  II,  0^+  II  r;  II] 

contains  an  eigenvalue  of  A.  If  some  intervals  overlap  then  a  bit  more  work  is 

’■equired  to  guarantee  approximations  to  p  eigenvalues.  See  [Pa,l980,Sec.l  1-5]. 

5.  If  satisfied  then  stop.  ’ 

In  the  context  of  Kuczynski’s  investigations  his  modification  is  entirely  reasonable; 
i.e.  he  selects  at  each  step  one  Ritz  pair  with  a  minimal  residual  norm.  However,  it  is  most 
misleading  to  those  not  familiar  with  the  Lanczos  algorithm  to  suggest  that  its  purpose  is 
simply  to  produce  this  Ritz  pair.  In  fact,  as  indicated  above,  the  Lanczos  algorithm  pro¬ 
duces  an  approximation,  albeit  crude,  to  the  whole  spectrum,  namely  0^,  ... ,  Oj  and  the 

user  is  free  to  select  from  this  set  to  suit  the  specific  goal.  Thus  to  approximate  the  right¬ 
most  eigenvalue,  one  concentrates  on  9j  and  continues  until  its  error  bound  II  rj  II  is  satis¬ 

factory.  In  practice  more  refined  error  bounds  can  be  made  but  that  is  not  germane  here, 
see  [Pa&No,1985]. 

It  would  have  been  preferable  to  state  the  Lanczos  algorithm  conventionally  and  then 
specify  the  modifications  appropriate  for  the  purpose  in  hand.  This  action  would  make 
clear  that  the  Lanczos  algorithm  is  not  trying  to  minimize  one  residual  norm.  That  is  why 
it  is  inferior  to  GMR  for  that  purpose. 

It  is  worth  pointing  out  here  that  in  the  model  of  arithmetic  used  in  these  studies  the 
cost  of  all  the  Rayleigh-Ritz  approximations  and  of  finding  the  minimal  residual  norm  over 

the  subspace  is  taken  as  nil.  It  might  occur  to  the  reader  that  in  this  context  it  would 
cost  no  more  per  step  to  compute  all  the  Rayleigh-Ritz  approximations  and  use  whatever 
approximations  one  desires.  The  setting  up  of  GMR  and  Lanczos  as  competing  algorithms 
is  artificial.  Moreover,  in  practice,  it  is  much  more  expensive  to  compute  the  minimal  resi¬ 
dual  than  to  compute  the  Rayleigh-Ritz  residuals.  Kuczynski  has  devoted  a  whole  report  to 
the  task. 
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5.4  Theoretical  Results 

From  p.  138 

‘We  are  ready  to  formulate  the  main  theory  of  the  paper. 

Theorem  3.1 

If  F  is  unitarily  (orthogonally)  invariant,  then  the  GMR  is  almost  strongly 
optimal  in  F ,  i.e.,  A,  b)  =  min  A:(0,  A,  b)+a,  for  any  (A,b)  e  FxS^, 

where  a  e  ( U,  1 ,  2 } .  ’ 


Here  A,  b)  is  the  minimal  number  of  steps  j  required  to  guarantee  an  e-residual 

with  algorithm  <I>  over  all  matrices  A  that  are  indistinguishable  from  A  with  the  given 
information  N=Nj=[b,  Ab,  A^b].  The  algorithm  d>  returns  a  pair  p,  x  whose  residual 
norm  II  Ax-xp  II  is  to  be  compared  with  e. 

Recall  that  with  information  Nj,  the  GMR  algoritnm,  by  definition,  picks  out  a  unit 
vector  xe  and  a  pe  C  that  produce  the  minimal  residual  norm. 

How  could  any  other  algorithm  possibly  do  better?  Well,  there  might  be  special  cir¬ 
cumstances  in  which  one  could  deduce  a  suitable  additional  component  of  that  last  vector 

A^b  that  is  not  used  by  GMR  in  forming  x,  although  A^h  is  used  in  calculating  the 

coefficients  of  GMR’s  approximation  from  KK  The  proof  studies  this  possibility  and  con¬ 
cludes  that  GMR  would  make  up  any  discrepancy  in  at  most  2  more  steps. 

The  argument  is  very  nice.  In  many  important  cases,  when  A  is  Hermitian  for  exam¬ 
ple,  then  the  constant  a  in  Theorem  3.1  is  actually  0. 

There  are  other  clever  results.  Theorem  4.2  shows  that  for  symmetric  matrices  the 
residual  norm  of  GMR  must  be  strictly  decreasing  at  least  at  every  other  step.  Theorem  5.1 
yields  a  beautiful  but  esoteric  fact  about  Krylov  subspaces  generated  by  Hermitian  A .  For 
the  worst  starting  vector  there  is  a  unit  vector  v  in  such  that 


Mli 

2j 


<  ||/lv-vp||  < 


Mi 

J 


for  j<n. 

As  indicated  above  these  nice  results  from  approximation  theory  do  not  add  up  to  a 
case  for  replacing  Rayleigh-Ritz  with  some  rival  algorithm. 

5.5  Numerical  Examples 


All  the  numerical  results  reponed  in  [Ku,1986]  concern  symmetric  tridiagonal 
matrices  with  starting  vector  ej  (  the  first  column  of  the  identity  matrix  /).  This  starting 
vector  ensures  that  the  Lanczos  algorithm  reproduces  the  original  matrix.  At  this  point  we 
should  recall  that  the  original  goal  of  the  Lanczos  algorithm  was  to  reduce  a  symmetric 
matrix  to  tridiagonal  form.  So  the  numerical  results  to  be  seen  below  do  not  relate  to  the 
Lanczos  recurrence  itself  but  merely  indicate  alternative  rules  for  stopping.  With  the  goal 
of  tridiagonalization  the  algorithm  always  stopped  at  step  n.  Later  it  was  realized  that 
excellent  approximations  to  a  few  eigenvectors  were  usually  obtained  early  in  the  process. 
There  is  no  single  stopping  criterion  for  current  Lanczos  algorithms;  termination  depends 
on  what  the  user  wants,  see  [Cu&Wi,1985],  [Pa,1980]  and  [Go&Va,1984]. 
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Kuczynski  provides  the  Lanczos  algorithm  with  a  stopping  criterion  to  suit  his  pur¬ 
poses  but  his  algorithm  GMR  could  have  been  called  (with  more  justice)  the  Lanczos  algo¬ 
rithm  with  a  new  stopping  criterion.  It  uses  the  minimum  residual  in  the  whole  Krylov  sub¬ 
space  instead  of  the  usual  (cheap)  Rayleigh-Ritz  approximations.  So  the  numerical  results 
simply  indicate  the  effect  of  these  different  termination  criteria. 

The  first  batch  of  results  concern  tridiagonals  with  nonzero  elements  chosen  at  random 
from  [-1/3,  1/3].  The  most  striking  feature  is  that  the  GMR  residual  and  the  smallest 
Rayleigh-Ritz  residual  slip  below  the  given  e  at  the  same  step  in  the  vast  majority  of  cases, 

panicularly  for  e  <  10"^.  In  Table  8.1,  with  e  =  10"^,  the  step  was  the  same  in  18  out  of 
20  cases.  In  the  other  two  the  Lanczos  algorithm  (i.e.  the  minimal  R-R  norm)  took  1  more 
step  (17  as  against  16). 

Some  weight  is  given  to  the  fact  that  the  smallest  R-R  residual  norm  is  rarely  mono¬ 
tone  decreasing  from  one  step  to  another  whereas  GMR  does  enjoy  this  property.  However 
if  the  approximate  eigenvalue  associated  with  the  minimum  residual  happens  to  change 
position  in  the  spectrum  from  step  to  step  then  this  monotonicity  of  GMR  is  not  associated 
with  the  convergence  to  a  specific  eigenvalue  of  the  original  matrix.  No  indication  is  given 
in  the  results  of  how  the  approximate  eigenvalue  implicitly  chosen  by  GMR  jumps  around 
the  spectrum.  In  practice  the  interesting  thing  to  know  is  how  many  Ritz  values  have  ‘con¬ 
verged’,  and  to  what  accuracy,  when  the  algorithm  is  terminated.  Unfortunately  this  infor¬ 
mation  is  excluded  from  the  GMR  viewpoint  and  is  not  reported. 

The  next  results.  Examples  8.1  and  8.2  in  [Ku,1986],  exhibit  the  dramatic  ‘failure’  of 
the  Lanczos  algorithm.  On  a  tridiagonal  matrix  of  order  201  and  norm  near  1  the  minimal 
R-R  residual  remained  at  its  initial  value  0.035  for  all  steps  except  the  last  (at  which  it 
must  be  0).  In  contrast  the  GMR  residual  declined  slowly  from  the  initial  0.035  to  0.0039 
at  step  200.  If  e  =  0.034  then  GMR  takes  2  steps  while  Lanczos  takes  201!  However  with 
e  <  10“^  both  algorithms  need  201  steps.  We  repeat  once  again  that  GMR  will  not  know 
which  eigenvalue  it  has  approximated. 

Unfortunately  no  attempt  is  made  to  put  this  example  in  context.  It  illustrates  the 
phenomenon  explored  in  some  detail  in  [Sc,  1979],  namely  that  for  every  symmetric  matrix 
with  distinct  eigenvalues  there  is  a  set  (with  positive  Lebesgue  measure  on  the  sphere)  of 
starting  vectors  such  that  no  Rayleigh-Ritz  approximation  is  any  good  until  the  last  step. 
We  must  repeat  that  the  Lanczos  algorithm  is  not  obliged  to  use  a  poor  initial  vector.  We 
ran  our  Lanczos  code  on  this  matrix.  Our  code  starts  with  a  normalized  version  of  Ar, 
where  A  is  the  given  matrix  (or  operator)  and  r’s  elements  are  chosen  at  random  from  a 
uniform  random  distribution.  The  reason  for  starting  with  Ar  is  compelling  when  sym¬ 
metric  A  has  an  unwanted  null  space.  The  results  are  given  in  the  Table  1. 

The  accepted  eigenvalues  (104  of  them  at  step  190)  agreed  with  those  computed  by 
EISPACK  to  all  of  the  15  decimals  printed  out.  The  efficiency  is  not  at  all  bad  considering 
that  this  is  a  difficult  eigenvalue  distribution  for  Krylov  space  methods. 

Example  8.2,  a  tridiagonal  of  order  501  with  null  diagonal  and  monotonely  increasing 
off  diagonal  elements,  caused  the  minimal  R-R  residual  norm  to  increase  from  0.001  ini¬ 
tially  to  0.011  at  steps  499  and  500.  In  contrast  GMR  residual  norms  declined  to  0.00036 
at  steps  499  and  500.  Thus  with  e  =  .00099  GMR  terminates  at  step  2  whereas  Lanczos 
terminates  at  step  501!  However  with  e  <  lO"’*  both  take  501  steps. 
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As  with  Example  8.1  Cj  is  a  bad  starting  vector  yielding  a  poor  Krylov  subspace.  We 
ran  our  Lanczos  program  and  found  the  results  given  Table  2. 

We  quote  the  final  paragraph  of  the  article. 

‘From  all  the  tests  we  have  performed  we  conclude  that  the  GMR  algorithm  is 
essentially  superior  to  the  Lanczos  Algorithm  on  matrices  with  constant  or 
increasing  codiagonal  elements.  For  random  matrices  or  matrices  with  decreasing 
codiagonal  elements,  both  algorithms  produce  nearly  the  same  residuals.’ 

The  revealing  word  here  is  ‘codiagonal’.  The  author  has  worked  exclusively  with  tri- 
diagonal  matrices  and  has  forgotten  that  the  goal  of  the  Lanczos  recurrence  is  to  produce  a 
tridiagonal  matrix!  Given  such  a  matrix  one  has  no  need  of  either  Lanczos  or  GMR.  As 
our  results  indicate  a  random  starting  vector  permits  the  Lanczos  algorithm  to  perform 
satisfactorily  even  on  such  craftily  designed  matrices.  The  quotation  reveals  just  how  far  a 
mathematical  theory  can  stray  from  relevance. 

5.6  Summary 

Here  is  an  attempt  to  formulate  the  numerical  analyst’s  version  of  Complexity  Theory 
for  Krylov  subspaces  and  eigenvalues. 

For  each  symmetric  nxn  matrix  there  are  initial  vectors  that  yield  an  eigenvalue 
in  one  step,  and  initial  vectors  that  yield  an  eigenvalue  only  at  the  nth  step.  The 
nontrivial  result  contained  in  the  Kaniel-Paige-Saad  error  bounds  (See 
[Pa,1980,Chap.l2].)  is  that  with  most  starting  vectors  the  extreme  eigenvalues 
can  be  formed  in  a  modest  number  of  steps  that  depends  on  the  distribution  of 
the  spectrum  and  is  nearly  independent  of  n. 

In  brief  our  criticism  of  [Ku,1986]  is  as  follows. 

Section  1  exposes  a  serious  flaw  in  the  model,  namely  the  goal. 

Section  2  exposes  a  subtle  way  in  which  features  of  a  method  are  pushed  into  the  problem 
statement;  the  starting  vector. 

Section  3  shows  how  standard  terms  can  be  redefined;  the  Lanczos  algorithm. 

Section  4  contains  some  nice  results  on  the  approximating  power  of  Krylov  subspaces. 
Section  5  shows  how  very  misleading  numerical  results  can  be  in  the  absence  of  proper 
interpretation. 
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Table  1:  Convergence  of  Ritz  pairs  on  ^201 


STEP 

e 

Number  of 

good  Ritz  values 

10-2 

1 

■■ 

10-3 

1 

40 

10-3 

4 

50 

10-3 

5 

60 

10-^  ,  10-^ 

7.4 

70 

10-3  ,  10-2 

10,  5 

80 

10-3  .  10-2 

14,  10 

90 

10-3  ,  10-2 

18.  14 

100 

10-3  ,  10-2 

23,  18 

110 

10-3  ,  10-2 

28,  24 

120 

10-3  ,  10-2 

35,  30 

130 

10-3  ,  10-2 

44,  37 

140 

1 

O 

1 

O 

52,  44 

150 

10-3  ,  10-2 

60,  54 

160 

10-3  ,  10-2 

70,61 

170 

10-3  ,  10-2 

83,75 

180 

10-3  ,  10-2 

99,  89 

190 

10-3  ,  10-2 

118,  109 
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Table  2;  Convergence  of  Ritz  pairs  on  7501 


STEP 

£ 

Number  of 

good  Ritz  values 

Bl 

10-2 

1 

■■ 

10-2 

1 

40 

10-2 

2 

50 

10-^ 

3 

60 

10-^  ,  10-^ 

5,2 

70 

10-2  ,  10*2 

7.3 

80 

10-2  ,  10-2 

9.6 

90 

10-2  ,  10-2 

13.  10 

100 

10-2  ,  10-2 

16,  13 

no 

10-2  ,  10-2 

20,  15 

120 

10-2  ,  10-2 

23,  20 

130 

10-2  ,  10-2 

29.  23 

140 

10-2  ,  10“ ' 

34,  29 

150 

10-2  ^  jq-7 

39,  34 
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